20 research outputs found
Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval
We consider the problem of composed image retrieval that takes an input query
consisting of an image and a modification text indicating the desired changes
to be made on the image and retrieves images that match these changes. Current
state-of-the-art techniques that address this problem use global features for
the retrieval, resulting in incorrect localization of the regions of interest
to be modified because of the global nature of the features, more so in cases
of real-world, in-the-wild images. Since modifier texts usually correspond to
specific local changes in an image, it is critical that models learn local
features to be able to both localize and retrieve better. To this end, our key
novelty is a new gradient-attention-based learning objective that explicitly
forces the model to focus on the local regions of interest being modified in
each retrieval step. We achieve this by first proposing a new visual image
attention computation technique, which we call multi-modal gradient attention
(MMGrad) that is explicitly conditioned on the modifier text. We next
demonstrate how MMGrad can be incorporated into an end-to-end model training
strategy with a new learning objective that explicitly forces these MMGrad
attention maps to highlight the correct local regions corresponding to the
modifier text. By training retrieval models with this new loss function, we
show improved grounding by means of better visual attention maps, leading to
better explainability of the models as well as competitive quantitative
retrieval performance on standard benchmark datasets
Learning Compositional Visual Concepts with Mutual Consistency
Compositionality of semantic concepts in image synthesis and analysis is
appealing as it can help in decomposing known and generatively recomposing
unknown data. For instance, we may learn concepts of changing illumination,
geometry or albedo of a scene, and try to recombine them to generate physically
meaningful, but unseen data for training and testing. In practice however we
often do not have samples from the joint concept space available: We may have
data on illumination change in one data set and on geometric change in another
one without complete overlap. We pose the following question: How can we learn
two or more concepts jointly from different data sets with mutual consistency
where we do not have samples from the full joint space? We present a novel
answer in this paper based on cyclic consistency over multiple concepts,
represented individually by generative adversarial networks (GANs). Our method,
ConceptGAN, can be understood as a drop in for data augmentation to improve
resilience for real world applications. Qualitative and quantitative
evaluations demonstrate its efficacy in generating semantically meaningful
images, as well as one shot face verification as an example application.Comment: 10 pages, 8 figures, 4 tables, CVPR 201
Towards Visually Explaining Variational Autoencoders
Recent advances in Convolutional Neural Network (CNN) model interpretability
have led to impressive progress in visualizing and understanding model
predictions. In particular, gradient-based visual attention methods have driven
much recent effort in using visual attention maps as a means for visual
explanations. A key problem, however, is these methods are designed for
classification and categorization tasks, and their extension to explaining
generative models, e.g. variational autoencoders (VAE) is not trivial. In this
work, we take a step towards bridging this crucial gap, proposing the first
technique to visually explain VAEs by means of gradient-based attention. We
present methods to generate visual attention from the learned latent space, and
also demonstrate such attention explanations serve more than just explaining
VAE predictions. We show how these attention maps can be used to localize
anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD
dataset. We also show how they can be infused into model training, helping
bootstrap the VAE into learning improved latent space disentanglement,
demonstrated on the Dsprites dataset